Taxonomy-driven lumping for sequence mining

ABSTRACT

Methods and apparatus are described for modeling sequences of events with Markov models whose states correspond to nodes in a provided taxonomy. Each state represents the events in the subtree under the corresponding node. By lumping observed events into states that correspond to internal nodes in the taxonomy, more compact models are achieved that are easier to understand and visualize, at the expense of a decrease in the data likelihood. The decision for selecting the best model is taken on the basis of two competing goals: maximizing the data likelihood, while minimizing the model complexity (i.e., the number of states).

BACKGROUND OF THE INVENTION

The present invention relates to data mining, and more specifically tomodeling sequences of events arranged in a taxonomy.

Markov models are fundamental mathematical structures widely used in thenatural and physical sciences, computer science, and engineering systemsfor describing and predicting processes. A Hidden Markov Model (HMM) isan extension of a Markov chain in which observable symbols are emittedin each of the states, but it is not possible to know exactly thecurrent state from the symbol observed. Markov models with or withouthidden states, first-order or higher-order, with or without lumping ofstates, have been extensively applied to sequence mining in the past.

The HMM parameters of a model can be adjusted by the well-knownBaum-Welch method to increase the likelihood of observed sequences ofsymbols. Another class of techniques for learning HMM parameters fromdata is based on model merging. These approaches start with a maximumlikelihood HMM that directly encodes all the observable samples. At eachstep, more general models are produced by merging previous simplersubmodels. The submodel space is explored using a greedy search strategyand the states to be merged are chosen to maximize the data likelihood.

State aggregation in Markov models. Many of the processes that can berepresented by HMMs suffer from the state space explosion problem. Statespace explosion occurs when the number of states grows too quickly forcomputation to solve more than trivial cases. As the number of statesrapidly increases, computers run out of time and/or memory to completethe computation. For example, problems that grow exponentially orcombinatorially with the size of the input suffer from state spaceexplosion. As a result, minimizing memory requirements and time iscrucial for most applications of HMMs. Aggregation techniques forreducing the number of states have been extensively studied.

Many approaches are based on the notion of lumpability, a property ofMarkov chains for which there exists a partition of the original statespace into aggregated states such that the aggregated Markov chainmaintains the characteristics of the original. A different approachreduces the structure of an HMM by partitioning the states using thebi-simulation equivalence, so that equivalent states can be aggregatedin order to obtain a minimal set that does not significantly affectmodel performance. A simple heuristic for HMMs is to merge states thathave the most similar emission probabilities. This approach has beenapplied to the domain of gesture recognition.

Sequence clustering. Sequence clustering is one of the most common tasksin sequence mining. This task has been handled by using frequentsubsequences or n-grams statistics as features or considering the editdistances among all the candidate sequences. Traditional methods oftenrequire sequence alignment and do not efficiently handle variable-lengthsequences.

One of the first works using HMM for sequence clustering computed thepairwise distance matrix for all the observed sequences by training anHMM for each sequence. The log-likelihood of each model given thesequence is used to cluster the sequences in K clusters using anExpectation-Maximization (EM) algorithm. A Markov-chain based clustermethod without hidden states using EM has also been implemented incommercial applications. In another approach, the HMMs are used ascluster prototypes. The clustering is computed by a combined approach ofthe HMMs and a rival-penalized competitive learning procedures. In anextension to the pairwise distance approach, HMMs are used to build anew representative space, where the features are the log-likelihoods ofeach sequence to be clustered with respect to a predefined number ofHMMs trained over a set of reference sequences.

Sequence clustering can also be used for probabilistic user behaviormodels to describe and predict user actions. User actions are describedby the conditional probability of performing an action given theprevious action, plus binary features that indicate the presence of acertain action in the user's history.

Sequence mining applications. There are many applications of sequencemining. Two areas of interest are web usage mining and spatio-temporaldata mining. Sequential pattern mining is one of the most common datamining techniques for Web data analysis. Markov models have been appliedfor modeling user web navigation sessions, describing user behavior,mining web access logs and for query recommendation. Mobility dataanalysis is a research area rapidly gaining a great deal of attention,as witnessed by the amount of spatio-temporal data mining techniquesthat have been developed in the last years.

Non-Markov based methods are also known in the art. Taxonomy-driven datamining has been mainly considered in the context of frequent patternextraction: originally taxonomies were used for mining association rulesand sequential patterns of itemsets in market-based data, where eachitem is a member of a hierarchy of product categories. More recently,taxonomy-based methods were used for mining frequent-subgraph patternsin biological pathways, where graphs of interacting proteins annotatedwith functionality concepts form a very large taxonomy.

SUMMARY OF THE INVENTION

According to the present invention, methods and apparatus are presentedfor modeling event data on a computer. The event data comprise sequencesof symbols or events each representing an order of actions taken by auser. Events are grouped into a taxonomy which can be represented as ahierarchy or tree on the data, with each event mapping to a leaf node inthe tree. A plurality of candidate Markov models are identifiedrepresenting the probability of a user transitioning from a first nodein the tree to any second node. Each Markov model comprises a subset ofnodes in the taxonomy. Markov models are formed such that every event isrepresented by its corresponding leaf node or an ancestor node in thetree. Further, no Markov model contains both a node and an ancestor ofthat node. Additional Markov models are generated by merging selectednodes in the taxonomy into a corresponding ancestor node to limit thesearch space. The fitness of the candidate Markov models is measuredwith a fitness policy. Some of the candidate Markov models are selectedwith reference to the fitness measure and one or more resourceconstraints. A preferred Markov model is chosen according to anobjective function balancing desired characteristics.

According to further embodiments, the event data are partitioned intomultiple clusters. Each cluster is assigned a Markov model using thedescribed process. These clusters are iteratively adjusted byreassigning each event sequence to the cluster whose preferred MarkovModel maximizes the objective function for that sequence. In someembodiments, the event data may comprise search queries submitted to asearch engine, purchases on an online commerce site, locations on a map,pages visited on one or more websites, or user interactions with asoftware system. Advertisements may be selected for and displayed to auser based on a probability represented by a preferred Markov model.

A further understanding of the nature and advantages of the presentinvention may be realized by reference to the remaining portions of thespecification and the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example problem domain where embodiments of theinvention may be practiced.

FIG. 2 presents a flowchart with an example process for practicingembodiments of the invention.

FIG. 3 illustrates a particular embodiment of merging adjacent nodes tocreate new model candidates.

FIG. 4 is a simplified diagram of a computing environment in whichembodiments of the present invention may be implemented.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Reference will now be made in detail to specific embodiments of theinvention including the best modes contemplated by the inventors forcarrying out the invention. Examples of these specific embodiments areillustrated in the accompanying drawings. While the invention isdescribed in conjunction with these specific embodiments, it will beunderstood that it is not intended to limit the invention to thedescribed embodiments. On the contrary, it is intended to coveralternatives, modifications, and equivalents as may be included withinthe spirit and scope of the invention as defined by the appended claims.In the following description, specific details are set forth in order toprovide a thorough understanding of the present invention. The presentinvention may be practiced without some or all of these specificdetails. In addition, well known features may not have been described indetail to avoid unnecessarily obscuring the invention.

Many real-world application domains are equipped withhierarchically-organized ontologies or taxonomies. The term “taxonomy”herein refers to any such hierarchical classification or grouping ofelements. For example, the Linnaean taxonomy of organisms arrangesspecies into hierarchical categories including kingdom, phylum, class,order, family, and genus. Often a taxonomy can be expressed as a tree.Taxonomies are useful for modeling data in such domains for at least tworeasons. First, more compact, meaningful and understandable abstractionsof the data can be produced by presenting it in terms of more generalnodes in the taxonomy. Second, taxonomies can constrain the search spaceof data-mining algorithms, allowing more efficient and scalabletechniques to be devised.

According to various embodiments of the present invention, sequences aremodeled with Markov models whose states correspond to nodes in aprovided taxonomy. Each state represents the events in the subtree underthe corresponding node. By lumping observed events into states thatcorrespond to internal nodes in the taxonomy, more compact models areachieved that are easier to understand and visualize, at the expense ofa decrease in the data likelihood. The decision for selecting the bestmodel is taken on the basis of two competing goals: maximizing the datalikelihood, while minimizing the model complexity (i.e., the number ofstates).

The problem is formally defined and characterized, and a search methodis given for finding a good trade-off among the two aforementionedgoals. Unlike previous approaches, this approach introduces a naturalconstraint and leads to an efficient algorithmic solution.

The problem addressed herein can be distinguished from previous efforts.Given a set of events (or symbols, or items), a taxonomy tree on thoseevents, and a dataset of event sequences, we study the problem offinding efficient and effective ways of producing a compactrepresentation of the sequences. This representation is also used tocluster the sequences. It is worth noting the different perspectiveherein: in the present context, states aggregation or abstraction isdone for the sake of useful and actionable knowledge modeling, and notonly for reducing computational requirements.

According to embodiments of the present invention, the initial model isbuilt starting from the relative frequencies of each sample, yielding amaximum likelihood model for the data. This model differs from previousapproaches in two fundamental aspects due to the hierarchical structureof the input symbols. First of all, the natural constraints imposed bythe data allow inferring the states directly from the emitted symbols,with the correspondent advantage of not having hidden states. Thus, itis worth noting that the model is not a Hidden Markov Model, avoidingdownsides of that approach. Moreover, since the greedy search employedis “guided” by a hierarchical structure, two states can be merged onlyif they have the same parent in the taxonomy. In this way, the hierarchybased merging drastically reduces the search space. In addition, insteadof exploring all possible models, adjacent models are tested in eachiteration.

Many different application domains fit within the described framework:in general, any problem regarding user (or customer) profiling, wherethe set of possible actions is hierarchically structured in a taxonomy.FIG. 1 depicts one example of such a domain. Taxonomy 100 includessearch queries 101-107 arranged in a hierarchy of categories 111-115.Each query represents a search term entered by a user into a searchengine such as Yahoo! Search. In this example, users have searched forpages on football, golf, poker, chess, 24, Heroes, and 30 Rock. Eachterm has been assigned to a category: Sports (football, golf), Games(poker, chess), and TV (24, Heroes, 30 Rock). These categories are inturn arranged into a hierarchy of further categories, with Sports andGames being grouped together under Recreation, which is grouped with TVunder Entertainment. The data is depicted as a tree to illustrate theserelationships. This depiction is largely conceptual; in practice, thedata may be stored in any number of ways, including but not limited totrees, lists, arrays, tables, maps, and databases. For example, asparse-matrix representation could be used for the sequence data.Further, taxonomies having any number of branches or levels, balanced orunbalanced, are contemplated by the invention.

In addition to a taxonomy, embodiments of the invention use Markovmodels to model transitions between states. These transitions arederived from a sequence of events. Sequence data 120 comprises sequences121-125 of search terms entered by users. Each sequence contains searchterms entered by a particular user in a particular order. For example,in sequence 124, the user first searched for the term “golf”, followedat a later time by a search for the term “chess”, and subsequently theterms “poker” and “Heroes”. While sequence 124 contains four searchterms, in general a sequence may be any length. Similarly, any number ofsequences may be used, not just the five sequences shown.

As another example, consider a large web site containing different pagesand providing different services. The web site owner may be interestedin profiling users with respect to their activity inside the site, forunderstanding which services and which parts of the site appear to besequentially connected by users' activities. Events for the model wouldbe the individual pages or services of the site which the users access.Each sequence would be the order in which a particular user visitedthose pages or services. The taxonomy could be the path hierarchy forthe URLs visited.

As yet another example, consider the interaction between a softwaresystem (such as Yahoo! Messenger) and its users. The software system mayrecord user activities, and then the software developers may want toanalyze these activity traces in order to understand how users interactwith the system and how to improve it. The possible user actions arecommands which are naturally organized in a hierarchy, e.g. in thedifferent toolbars and menus of the software system.

As a final example, embodiments of the invention can be applied to modeltrajectories on a map. A trajectory in this context is a sequence oflocations visited. It may be represented as discrete pointscorresponding to individual locations (e.g. gas station, grocery store,home, etc) or a continuous line along the route traveled, among otherpossibilities. The map may represent a physical location such as the SanFrancisco Bay Area, the city of Oldenburg, or the solar system. It mayalso map virtual, logical, or fictional locations, such as a map of theinternet, a social networking service, or Gotham City.

According to certain embodiment, sequences may be obtained fromtrajectories by dividing the map into a grid. Each square in the gridrepresents one event or symbol for the model, and each trajectorydefines a sequence of grid squares visited. To create a hierarchy overthe points, a tree is created by taking an area covering the whole mapas the root, and then recursively dividing this area until the full gridis reached. For example, a 32×32 grid creates a tree with 1,024 leaves,one for each grid square. A taxonomy is created by dividing the rootnode (i.e. the entire map) into quadrants, and then recursively dividingeach section into quadrants until the 32×32 grid is reached. Thisproduces a taxonomy of six levels.

Contexts in which the framework is useful include two applicationsdescribed herein: (i) query log mining, where user queries areclassified in a topical taxonomy, and (ii) mining trajectories of movingobjects, where the hierarchy is given by the natural spatial proximity.In all of the above-mentioned applications, automatically defining themost appropriate level of granularity to represent the information ischallenging.

According to the present invention, the modeling representation adoptedfor a collection of sequences is a Markov model. The states of the modelare nodes in the taxonomy, where the last level (leaves) contains theobservable symbols corresponding to events. Upon visiting each state,the Markov model emits one symbol, which can be any leaf in the sub-treeunder the node corresponding to that state. Although this model involvestransitions and emissions, the model is not a Hidden Markov Model. Thatis, due to the one-to-many mapping from states to symbols in the model,we can always recover exactly the state that emits each symbol. Thereare no truly hidden states.

By using internal nodes in the taxonomy tree to represent the Markovstates, the likelihood of the data given the model is decreased withrespect to a model created at the leaf level. The higher we go in thetaxonomy, the more the likelihood decreases, but the greater thesimplicity of the obtained models. This helps make the models moregeneral, more meaningful for the domain experts, and easier tovisualize. The decision between alternative models is based on thefollowing competing criteria: (i) the data likelihood should be maximal,and (ii) the model complexity should be minimal.

The approach of using (non-hidden) Markov states to represent disjointsubsets of symbols is inspired by the fundamental ideas of lumpabilityand approximate lumpability in Markov chains. In addition to lumping,the emission probabilities of the symbols at the lumped states are usedto express the total likelihood of the data. Since the model does nothave hidden states, we can bypass the computational challenges of HMMs,such as the estimation of model parameters using the Baum-Welchalgorithm. In the present model, the parameters that yield the maximumlikelihood are given by simple frequency counts. The computation of thelikelihood of a sequence given a model is also done by counting, and notby using a sequence decoder such as the Viterbi algorithm.

A further advantage of Markov models is that they can be directly usedfor visualization of the mining results. For instance, in a trajectoryclustering application, the nodes of the Markov model can be laiddirectly over the geographic map, to represent areas with the associatedtransition probabilities among them.

Once a Markov model is chosen to represent the data, the probabilityinformation it provides can be used for a variety of purposes. Forinstance, the model may predict that a user visiting a search term in acategory Sports has a 70% chance of next searching for a term in acategory Games. This prediction can be used to present relevantinformation to the user, such as displaying ads for poker websites to auser entering the search term “football”.

The following mathematical notation will be used to further describecertain aspects of the invention. The data to be modeled can be denotedas a set of r sequences

={σ₁, . . . , σ_(r)} over a set of Σ={α₁, . . . , α_(m)}. A specialmeaning is assigned to two of the symbols of Σ, namely, α₁=

denotes the starting symbol and α_(m)=□ denotes the terminal symbol. Thefirst symbol of all sequences is the starting symbol (α_(j1)=

). The last symbol of all sequences is the terminal symbol(α_(j)I_(j)=□). No other symbol is either starting or terminal. Thesymbols of Σ form a taxonomy T, which is simply a tree whose leaf nodesis the set Σ.

For any symbol α ε Σ, let c_(a) denote the total number of times that αappears in all sequences. Hence o

=o□=r because there are r sequences and every sequence begins and endswith the starting and terminal symbols, respectively. This notation isextended by using c_(αβ) to denote the total number of times that symbolβ follows symbol α in all sequences.

Consider a set of s states X={x₁, . . . , , x_(s)}. A Markov model on Xis defined by transitional probabilities p_(x,y), for each pair ofstates x,y εX, which determine the probability that the next state willbe y given that the current state is x. For all x ε X, it holds thatΣ_(yεx)p_(x,y)−1.

Embodiments of the invention utilize a Markov model M with s>2 statesX={x₁, . . . , x_(s)}, where each state x ε X corresponds to a set ofsymbols A(x)⊂Σ. Further, {A(x)} forms a partition of Σ. In mathematicalterms, this means that ∪_(xεx)A(x)=Σ and A(x)∩A(y)=Ø for all x,y ε x.This partition will be in practice a cut in the taxonomy tree. The cutensures that each state is represented by a single node from thetaxonomy in the Markov model. For any given node in the Markov Model, noancestor nodes in the taxonomy will appear in the partition (because theA(x) sets are disjoint) or the resulting Markov model.

The states x₁ and x_(s) are special states used for denoting thestarting and terminal symbols. No other symbols are assigned to thosestates. Hence A(x₁)={

} and A(x₈)={□}. Conversely, for a symbol α let x(α) denote the uniquestate to which α is assigned, i.e. α ε A(x(α)). To give correctinterpretation to the starting state x₁ and the terminal state x_(x), weassume that for all x ε X, p_(x,x1)=0, p_(x1,xs)=0, and p_(xs,xs)=1.

For understanding certain aspects of the invention, a Markov model isdenoted by M=(X, A, p, q), where X is the set of states, A is thefunction mapping states to sets of symbols, and where p and q are thevectors containing all the transition and emission probabilities,respectively.

Describing the Markov model in terms of transition and emissionprobabilities evokes comparisons to a hidden Markov model. However, aspreviously mentioned, since each symbol in our model corresponds to aunique state, there are no hidden states. Given a Markov model M=(X, A,p, q) as described above, another Markov model can be defined M′=(Σ, r),whose set of states is the set of symbols Σ. M and M′ are equivalent,and M′ is a proper first-order Markov model with no hidden states.

Lemma 1. Given a model M=(X,A,p,q) with transition and emissionprobabilities there is an equivalent Markov model M′=(Σ, r) with nohidden states, where

r _(α,β) =p _(x)(α)x(β)q _(x)(β)β.

Likelihood of a dataset. Given a dataset of sequences

={σ₁, . . . , σ_(r)}, and a Markov model M=(X,A,p,q), the likelihood ofthe data given the model is computed as

${L()} = {\prod\limits_{\alpha,{\beta \in \sum}}{( p_{{x{(\alpha)}},{x{(\beta)}}} )^{c_{\alpha\beta}}{\prod\limits_{\alpha \in \sum}( q_{{{x{(\alpha)}},\alpha})} )^{c_{\alpha}}}}}$

The first product is due to transitions among states, and the secondproduct is due to emissions of symbols. To avoid numerical underflow itis convenient to work with the minus log-likelihood, which is

$\begin{matrix}{{{S_{L}()} = {- {\sum\limits_{\alpha,{\beta \in \sum}}{c_{\alpha,\beta}\log \; {p_{x}(\alpha)}}}}},{{x(\beta)} - {\sum\limits_{\alpha \in \sum}{c_{\alpha}\log \; q_{{x{(\alpha)}},\alpha}}}}} & (1)\end{matrix}$

Maximum-likelihood estimation. Consider the input sequences

={σ₁, . . . , σ_(r)} and a Markov model M in which only the states X andthe state-to-symbol mapping A have been specified. The task is tocompute the transition and emission probabilities, p and q, so that theS_(L)(

|

) function is minimized. It is well known that the maximum-likelihoodprobabilities are estimated as the observed frequencies

$\begin{matrix}{{{\overset{\_}{p}}_{x,y} = \frac{c_{xy}}{c_{x}}}{and}} & (2) \\{{\overset{\_}{q}}_{x,\alpha} = \{ \begin{matrix}{c_{\alpha}/c_{x}} & {{{if}\mspace{14mu} \alpha} \in {{A(x)}\mspace{14mu} {and}}} \\0 & {{otherwise},}\end{matrix} } & (3)\end{matrix}$

where c_(xy)=Σ_(αεA(x);βεA(y))C_(αβ) is the total number of times that asymbol of state x is followed by a symbol of state y, andc_(x)=Σ_(αεA(x))c_(α) is the total number of times that a symbol ofstate x appears in the data. This leads to the following.

Observation 1. Given a set of sequences D, and a Markov modelM=(X,A,·,·) for which only the states are pre-specified, the optimalscore of the minus log-likelihood function is given by

$\begin{matrix}{{S_{L}^{*}()} = {{- {\sum\limits_{x,{y \in X}}{c_{xy}\log \; \frac{c_{xy}}{c_{x}}}}} - {\sum\limits_{\alpha \in \sum}{c_{\alpha}\log \; \frac{c_{\alpha}}{c_{x{(\alpha)}}}}}}} & (4)\end{matrix}$

where c_(α), c_(x) and c_(xy) are as defined above.

The next question is to find the model that yields the highestlikelihood for a dataset. That is, among all possible mappings(partitions) of the symbol set Σ to states, find the partition thatgives a model

that minimizes the score in Equation (4). This can be done as follows.

Lemma 2. The Markov model that minimizes Equation (4) is the“leaf-level” model with |Σ|=m states X={x₁, . . . , x_(m)}, wherex_(i)={α_(i)}.

Lemma 2 is a direct consequence of the following more general fact.

Lemma 3. Consider Markov models M₁ and M₂ for which for every state x ofM₁ and every state y of M₂ it is either A(x)⊂A(y) or A(x)∩A(y)=Ø. Inother words, the states of M₁ are a sub-partition of the states of M₂.Then

S _(L)*(

|

₁)≦S _(L)*(

|

₂)

Comparing the minus log-likelihood scores of the two models directly isnot straightforward. Some terms of the difference are positive whileother terms are negative, and it is not easy to compare them. Luckilytools provided by information theory can be used.

Intuitively, we want simple models with a small number of states, sincesuch models are more useful to understand the data, and they avoidoverfitting. However, as the previous Lemma shows, there is a trade-offbetween likelihood and simplicity of the model. A simpler problem toconsider is finding the best model with a given number of states.

Problem 1 (k-state-optimal model). Given a set of sequences

={σ₁, . . . , σ_(r)} and a number k, find a Markov model M that has atmost k states and minimizes the score S_(L)*(

|

).

However, the constraint of using k states might be too stringent and inmany cases we may not know which is the correct number of states. Wewould like to have an objective function that balances the likelihoodscore and the number of states. This is a typical model selectionproblem, and many different approaches have been proposed, includingminimum-description length (MDL) criteria, Bayesian informationcriterion (BIC), cross-validation methods, etc. BIC does not performwell for the size of the data contemplated. Essentially, for large data,the logarithmic factor of the BIC formula is orders of magnitude smallerthan the minus log-likelihood score, and thus there is no sufficientpenalization for model complexity.

Certain embodiments of the invention use a model-selection objective inwhich the minus log-likelihood and the model complexity are consideredtogether. The task is to find the model that is as close as possible toa model with an ideal score. Finding the ideal-scoring model itself isnot feasible in many cases due to the state space explosion problem.Instead, a number of candidate models are evaluated using a fitnessfunction to find the closest fit within a limited number of resources(e.g., time, computing power, energy usage, etc). Let

be the model with the minimum possible number of states s_(min) thatachieves the maximum possible score S_(L,max)*. Let

m be the model with the maximum possible number of states s_(max)=m thatachieves the minimum possible score S_(L,min)*. Then, for a model M withs states that achieves score S_(L)*=S_(L)*(

), the objective function is defined as

${{Dist}^{2}{()}} = {( \frac{s - s_{\min}}{s_{\max} - s_{\min}} )^{2} + {w \cdot ( \frac{S_{L}^{*} - S_{L,\min}^{*}}{S_{L,\max}^{*} - S_{L,\min}^{*}} )^{2}}}$

The parameter w is a scale factor controlling the importance of the twoterms of the objective function. Values of w such as 1 or 10 have beenfound to work in practice. The corresponding model-selection problem forthe objective function can be restated as:

Problem 2. Given a set of sequences

={σ₁, . . . , σ_(r)}, find a Markov model M that minimizes the objectiveDist²(

).

In some instances, limiting the number of states may be desirable. Ahybrid objective function combining an upper bound on a desirable numberof states with minimizing the objective Dist²(

) may be stated as:

Problem 3. Given a set of sequences

={σ₁, . . . , σ_(r)} and a number k, find a Markov model M that has atmost k states and minimizes the score Dist²(

).

FIG. 2 presents a flowchart illustrating an example of a processimplemented in accordance with a specific embodiment of the invention.The scenario of FIG. 1 is assumed, i.e., modeling a collection of searchterms 201 and search query sequences 202 within a taxonomy of categories203. Reference to elements of FIG. 2 will be made throughout thefollowing general description. Some elements are simplified or omittedin FIG. 2 for clarity.

A method for modeling sequences of symbols (or events, or other data) ona computer with a Markov model takes as input a hierarchy T on thesymbol set Σ and a set of sequences of elements from Σ (corresponding todata elements 201-203 of FIG. 2). The method uses a “bottom-up” searchalgorithm with the following components: (i) an objective function g;(ii) a fitness policy p; (iii) a priority queue Q of candidate models toevaluate; (iv) a set E of models already evaluated.

The objective function g may be drawn from any of a family of objectivefunctions {g} defined as g:

×

→

, where the first argument represents a minus log likelihood score andthe second argument represents the number of states of a model. Forillustrative purposes, the objectives functions used as examples hereinare the ones defined in Problems 1, 2, and 3. For Problems 1 and 3, if amodel has more than k states then g returns the value ∞. Numerous otherfunctions for g are possible, as appreciated by those skilled in theart.

The fitness policy p is a total ordering on pairs (v, s) in

×

, where again v represents a minus log likelihood score and s representsthe number of states of a model. Let M_(i) and M_(j) be two models to becompared respectively with scores v_(i), v_(j) and s_(i), s_(j) states.For example, the three following fitness policies could be used.

(i) ProbabilityFirst: p(v_(i), s_(i))>p(v_(j), s_(i)) if and only ifv_(i)<v_(j) or (v_(i)=v_(j) and s_(i)<s_(j));

(ii) StatesFirst: p(v_(i), s_(i))>p(v_(j), s_(i)) if and only ifs_(i)<s_(j) or (s_(i)=s_(j) and v_(i)<v_(j));

(iii) DistSqFirst: p(v_(i), s_(i))>p(v_(i), s_(i)) if and only ifg(v_(i), s_(i))<g(v_(i), s_(i)).

In the first case, the policy favors the likelihood minimization withrespect to the complexity of the model. The converse is done in thesecond case. In the last case, the policy balances the likelihood scoreand the number of states by selecting the model that minimizes theobjective function g. Limitless other fitness policies will be readilyapparent to those skilled in the art.

The method initially puts in the queue Q the “leaf-level” model (213),i.e., the model with states X={x₁, . . . , x_(m)}, where x_(i)={α_(i)}.This is the model composed of states corresponding to each symbol in Σ,with no higher level (non-leaf) nodes from the taxonomy (212). Startingfrom this model, the method iteratively searches for other models whichyield better results according to the fitness policy.

A maximum number of iterations are defined corresponding to the amountof resources to be invested in creating the model (219). These resourcesmay comprise a period of time to run computations, a number of computingcycles, an amount of processing power or energy to use, or any otherresource constraints as appreciated by those skilled in the art.

In each iteration (214-218), as long as the queue Q is not empty and themaximum number of iterations has not been reached (219), the model Mwith the best score (214) according to the policy p is removed from thequeue (215). The log-likelihood v=S_(L)*(

|

) is evaluated for M, and the number of states of M is assigned to s.The model M is inserted in the set ε of evaluated models, along with thescore g(v, s).

In addition, all models Mi that result from M with one merging stepaccording to the hierarchy T are generated (217). A state-merging stepis performed by merging the children (immediate descendants) of a singlenode x in T to x (216). If M_(i) has not already been evaluated (checkin ε), then M_(i) is put in the queue of candidates Q (218), accordingto the ordering p(v, s).

When the algorithm terminates, it returns the model with the best scoreg (220), among all models that have been evaluated and put in the set E.How this model is used depends on the context. For example, in thecontext of FIG. 2 the model may be used to predict the next category auser will visit (221). This information can be used for a variety ofpurposes, including the aforementioned example of selectingadvertisements to display for the user.

In some embodiments, each candidate model is inserted in Q with theminus log likelihood score of its parent, which is a lower bound on itsown score. This optimization is faster than computing the likelihoodscore of each candidate. It should also be noted that queue Q is merelya convenient technique for describing the process. Q can be implementedwith many different data structures other than queues, and the processcan be structured in numerous other ways without Q to achieve the sameresult.

FIG. 3 illustrates a particular embodiment of merging adjacent nodes tocreate new model candidates. Tree 301 depicts the leaf-level model or“cut” 305 used as the initial model in certain embodiments of theinvention. Nodes 311-311—the leaves of the tree, representing searchterms entered by a user—are selected for this cut. A Markov model isthen constructed using those nodes as states. This corresponds to thefirst candidate Markov model evaluated. Other candidates are generatedfrom tree 301 according to the described state-merging algorithm. Eachset of adjacent (sibling) nodes in cut 305 is merged into their parentnode to create another cut of the tree.

Tree 302 shows the cut 306 formed by merging nodes football 311 and golf312 into their parent node Sports 323. The other nodes in cut 306 areinherited from cut 305, that is, nodes 313-317. Cut 306 thus consists ofnodes 323 and 313-317. Similarly, cut 307 merges sibling nodes poker 313and chess 314 into the parent node Games 324, creating a cut consistingof nodes 311-312, 324, and 315-317. Finally, cut 308 merges the nodesunder TV 325 to create a cut comprising nodes 311-314 and 325. Thus cuts306-308 represent the three possible cuts formed by a singlestate-merging operation on cut 305. Each cut 306-308 may be used tocreate a candidate Markov model for the next iteration.

Further cuts are not possible with a single merge operation. Forexample, the nearest common ancestor of golf 312 and poker 313 is thecategory Recreation 322. But Recreation is not the parent node of eithergolf or poker; Recreation is a more distant ancestor, separated fromgolf and poker by the intervening nodes Sports 323 and Games 324. Thecut combining golf and poker thus requires three merge operations toreach from cut 305: one merge of football and golf into Sports, anothermerge of poker and chess into Games, and a third merge of Sports andGames into Recreation. Embodiments using the single-merge rule toidentify cuts based on cut 305 would not consider this cut.

However, the same cut may be reached within these rules through otherpaths. For example, a cut comprising nodes Recreation 322 and TV 325 maybe created by a single merge operation on a cut comprising nodes Sports323, Games 324, and TV 325. Other embodiments which do not employ aone-merge rule may perform such merges directly from the leaf-level cut305 or other cuts. Whether these cuts are actually reached duringoperation of an embodiment depends on how many iterations are performedand the fitness measures of preceding cuts along the path from thestarting cut.

Without the constraint imposed by the taxonomy, the search space of themethod would be the set of all possible partitions of the symbol set Σ,which is exponentially large. Performing merges only along the hierarchyof nodes reduces the search space dramatically. Nevertheless, even withthe use of a hierarchy, the search space has exponential size. Inpractice, it is impossible to explore it completely for all but toy-sizedatasets. Finding a good solution depends mostly on using a fitnesspolicy that allows reaching a good solution fast. Assuming a givenfitness policy, being able to evaluate candidate models fast is alsovery important, so that a large quantity of candidate models can beevaluated per unit of time.

The costly part of evaluating the objective function of a model iscomputing the S_(L)* score. For computing S_(L)*, only the counts c_(α)and c_(αβ) are needed. So the input dataset D only has to be read once,all the information in the counts c_(α) and c_(αβ) summarized, and thenall evaluation using those counts performed. In fact, assuming a sparserepresentation of the matrix c_(αβ), its density cannot be higher thann/m², where n is the cumulative length of all sequences and m is thesize of the symbol set Σ. The amount of space (and time) needed for theevaluation of each model is O(min{n,m²}).

A faster model evaluation can be done incrementally, by computing thescore of a model with respect to the S_(L)* score of its “parent” model,which has already been evaluated as a by-product of our bottom-upalgorithm. In particular, consider a model M₁ with state space X, and achild model M₂ which is built from M₁ by merging a subset of d states ofM₁ into a single state in M₂. Observation 1 allows the difference in theS_(L)* score of the two models to be expressed in terms only of thecounters c_(xy) that involve the states x, y participating in themerging, without computing over the whole frequency matrix c. Theevaluation of each child model can be done in time O(min{n,md}) giventhat the score of the parent node is known.

The method can be extended in accordance with specific embodiments ofthe invention to identify clusters of sequences. A cluster in thiscontext is a subset of sequences with similar characteristics. Creatingseparate models for each cluster may yield better results with higherpredictive value. For example, suppose the sequence data representssearch queries as illustrated in FIG. 1. Students using the searchengine may commonly search for tv shows and games such as the examplesgiven. Meanwhile, working professionals may often search for tv showsand sports. Combining sequence data from both groups lowers thepredictive value of the model. The next likely category from a usersearching for tv shows may vary greatly depending on which cluster theuser belongs to. Students may be more likely to search for games whileprofessionals are more likely to search for sports. Identifying distinctclusters of users and creating separate Markov models for each improvesresults. If the user can be assigned to either the student orprofessional cluster, his subsequent searches can be predicted withhigher accuracy.

However, the problem faced is that these clusters are not known inadvance. Given only sequences of events, identifying clusters within thesequences is a difficult problem. How many clusters should be created?What criteria should separate the clusters? For example, are thesearches of retirees more similar to students or professionals? Or dothey belong in their own cluster?

According to various embodiments, different techniques for identifyingclusters may be used. For example, a standard Expectation-Maximization(EM) cluster identification technique can be used, in which each clusteris modeled using the Markov modeling methods described herein. Combiningthese techniques and methods yields a process for more accuratelymodeling users who fit within distinct groups.

Let k be the number of clusters we want to obtain; the clustering willbe a partition of the sequences D into k sets D₁, D₂, . . . , D_(k). Toinitialize the method, an initial partition is done. For example, arandom initial assignment can be performed, but other initializationprocedures can also be used. The method then proceeds iteratively. Oneach iteration: first, using the method previously described in thissection, the model

that maximizes

i * =  S L  ( i  )

is found for each subset D_(i). Next, the whole set of sequences

={σ₁, . . . , σ_(r)} is scanned. For each sequence σ_(j), the indexk_(j)* that maximizes the likelihood of that particular sequence isfound by

k j * = argmax k j ∈ { 1 , 2 ,  …  , k }  S L ( { σ j }  k j * )

Finally, the sequences are re-partitioned such that sequence σ_(j) ε

k_(j)*. Each sequence is reassigned to the partition whose model givesthe maximum likelihood for that sequence. The process stops when thefraction of elements reassigned becomes negligible.

The model inference phase in each iteration is faster than the one usingHMMs given the absence of hidden states. Also, in the evaluation phasefor each sequence σ, one Viterbi evaluation having cost m²|σ| where m isthe number of states, is turned into a simple summation of 2|σ| terms(one for the transition and one for the emission probability).

Embodiments of the present invention may be employed to model data inany of a wide variety of computing contexts. For example, as illustratedin FIG. 4, implementations are contemplated in which the relevantpopulation of users interact with a diverse network environment via anytype of computer (e.g., desktop, laptop, tablet, etc.) 402, mediacomputing platforms 403 (e.g., cable and satellite set top boxes anddigital video recorders), handheld computing devices (e.g., PDAs) 404,cell phones 406, or any other type of computing or communicationplatform.

And according to various embodiments, user data processed in accordancewith the invention may be collected using a wide variety of techniques.For example, collection of data representing a user's interaction with aweb site or web-based application or service (e.g., search queries,links selected, the number of page views, etc.) may be accomplishedusing any of a variety of well known mechanisms for recording a user'sonline behavior. User data may be mined directly or indirectly, orinferred from data sets associated with any network or communicationsystem on the Internet. And notwithstanding these examples, it should beunderstood that such methods of data collection are merely exemplary andthat user data may be collected in many ways.

Once collected, the user data may be processed in some centralizedmanner. This is represented in FIG. 4 by server 408 and data store 410which, as will be understood, may correspond to multiple distributeddevices and data stores. The invention may also be practiced in a widevariety of network environments including, for example, TCP/IP-basednetworks, telecommunications networks, wireless networks, etc. Thesenetworks, as well as the various social networking sites andcommunication systems from which connection data may be aggregatedaccording to the invention are represented by network 412.

In addition, the computer program instructions with which embodiments ofthe invention are implemented may be stored in any of a wide variety ofcomputer-readable storage media or memory, and may be executed accordingto a variety of computing models including a client/server model, apeer-to-peer model, on a stand-alone computing device, or according to adistributed computing model in which various of the functionalitiesdescribed herein may be effected or employed at different locationsand/or different computing platforms.

While the invention has been particularly shown and described withreference to specific embodiments thereof, it will be understood bythose skilled in the art that changes in the form and details of thedisclosed embodiments may be made without departing from the spirit orscope of the invention. In addition, although various advantages,aspects, and objects of the present invention have been discussed hereinwith reference to various embodiments, it will be understood that thescope of the invention should not be limited by reference to suchadvantages, aspects, and objects. Rather, the scope of the inventionshould be determined with reference to the appended claims.

1. A computer implemented method for modeling event data using apre-existing taxonomy of events, the event data representing a pluralityof sequences of events, each sequence comprising an order of eventsinitiated by a corresponding user, each event mapping to a leaf node ofthe taxonomy, the method comprising: identifying a plurality ofcandidate Markov models, each Markov model representing probabilities ofa user transitioning from any first node in the Markov model to anysecond node in the Markov model according to the sequences of events,each Markov model formed from a subset of nodes in the taxonomy bymerging selected nodes of the taxonomy into corresponding ancestor nodesof the taxonomy, wherein each event is represented by a node in eachMarkov model, and further wherein no Markov model contains both aparticular node and an ancestor of that particular node; measuring thefitness of the candidate Markov models with a fitness policy; selectingat least some of the plurality of candidate Markov models with referenceto the fitness measure and one or more resource constraints; andchoosing a preferred Markov model from the selected candidate Markovmodels with reference to an objective function.
 2. The method of claim1, wherein the objective function balances a likelihood score of eachselected candidate Markov model with a number of states corresponding tothe model.
 3. The method of claim 2, wherein the objective functioncomprises the relation${{{Dist}^{2}{()}} = {( \frac{s - s_{\min}}{s_{\max} - s_{\min}} )^{2} + {w \cdot ( \frac{S_{L}^{*} - S_{L,\min}^{*}}{S_{L,\max}^{*} - S_{L,\min}^{*}} )^{2}}}},$wherein s is the number of states in a candidate Markov model

with likelihood score S_(L)*, S_(min) is a minimum possible number ofstates in a Markov model achieving a maximum possible likelihood scoreS_(L,max)*, is a maximum possible number of states in a Markov modelachieving a minimum possible likelihood score S_(L,min)*, and w is ascaling factor.
 4. The method of claim 1, further comprisingpartitioning the event data into multiple clusters, each clusteryielding a preferred Markov Model, and iteratively adjusting theclusters by reassigning each event sequence to the cluster whosepreferred Markov Model maximizes the objective function for thatsequence.
 5. The method of claim 1 wherein selecting at least some ofthe candidate Markov models with reference to the fitness measurecomprises selecting each selected candidate Markov model according toone of (i) a likelihood score of the selected candidate Markov model,(ii) a minimal number of nodes in the selected candidate Markov model,or (iii) an objective function score on the selected candidate Markovmodel.
 6. The method of claim 1, further comprising displayingadvertisements for a user based on a probability represented by thepreferred Markov model.
 7. The method of claim 1 wherein the event datacomprises one of (i) search queries submitted to a search engine, (ii)purchases on an online commerce site, (iii) locations on a map, (iv)pages visited on one or more websites, or (v) user interactions with asoftware system.
 8. A system for modeling event data using apre-existing taxonomy of events, the event data representing a pluralityof sequences of events, each sequence comprising an order of eventsinitiated by a corresponding user, each event mapping to a leaf node ofthe taxonomy, the system comprising one or more computing devicesconfigured to: identify a plurality of candidate Markov modelsrepresenting the probabilities of a user transitioning from any firstnode in the model to any second node in the model according to thesequences of events, each Markov model formed from a subset of nodes inthe taxonomy by merging selected nodes of the taxonomy intocorresponding ancestor nodes of the taxonomy, wherein each event isrepresented by a node in each Markov model, further wherein no Markovmodel contains both a particular node and an ancestor of that particularnode; measure the fitness of the candidate Markov models with a fitnesspolicy; select at least some of the plurality of candidate Markov modelswith reference to the fitness measure and one or more resourceconstraints; and choose a preferred Markov model from the selectedcandidate Markov models with reference to an objective function.
 9. Thesystem of claim 8, wherein the objective function balances a likelihoodscore of each selected candidate Markov model with a number of statescorresponding to the model.
 10. The system of claim 9, wherein theobjective function comprises the relation${{{Dist}^{2}{()}} = {( \frac{s - s_{\min}}{s_{\max} - s_{\min}} )^{2} + {w \cdot ( \frac{S_{L}^{*} - S_{L,\min}^{*}}{S_{L,\max}^{*} - S_{L,\min}^{*}} )^{2}}}},$wherein s is the number of states in a candidate Markov model

with likelihood score S_(L)*, S_(min) is a minimum possible number ofstates in a Markov model achieving a maximum possible likelihood scoreS_(L,max)*, is a maximum possible number of states in a Markov modelachieving a minimum possible likelihood score S_(L,min)*, and w is ascaling factor.
 11. The system of claim 8, wherein the one or morecomputing devices are further configured to partition the event datainto multiple clusters, each cluster yielding a preferred Markov Model,and iteratively adjust the clusters by reassigning each event sequenceto the cluster whose preferred Markov Model maximizes the objectivefunction for that sequence.
 12. The system of claim 8 wherein the one ormore computing devices are configured to select at least some of thecandidate Markov models with reference to the fitness measure byselecting each selected candidate Markov model according to one of (i) alikelihood score of the selected candidate Markov model, (ii) a minimalnumber of nodes in the selected candidate Markov model, or (iii) anobjective function score on the selected candidate Markov model.
 13. Thesystem of claim 8, wherein the one or more computing devices are furtherconfigured to display advertisements for a user based on a probabilityrepresented by the preferred Markov model.
 14. The system of claim 8wherein the event data comprises one of (i) search queries submitted toa search engine, (ii) purchases on an online commerce site, (iii)locations on a map, (iv) pages visited on one or more websites, or (v)user interactions with a software system.
 15. A computer program productfor modeling event data using a pre-existing taxonomy of events, theevent data representing a plurality of sequences of events, eachsequence comprising an order of events initiated by a correspondinguser, each event mapping to a leaf node of the taxonomy, comprising atleast one computer-readable medium having computer instructions storedtherein which are configured to cause one or more computing devices to:identify a plurality of candidate Markov models representing theprobabilities of a user transitioning from any first node in the modelto any second node in the model according to the sequences of events,each Markov model formed from a subset of nodes in the taxonomy bymerging selected nodes of the taxonomy into corresponding ancestor nodesof the taxonomy, wherein each event is represented by a node in eachMarkov model, further wherein no Markov model contains both a particularnode and an ancestor of that particular node; measure the fitness of thecandidate Markov models with a fitness policy; select at least some ofthe plurality of candidate Markov models with reference to the fitnessmeasure and one or more resource constraints; and choose a preferredMarkov model from the selected candidate Markov models with reference toan objective function.
 16. The computer program product of claim 15,wherein the objective function balances a likelihood score of eachselected candidate Markov model with a number of states corresponding tothe model.
 17. The system of claim 16, wherein the objective functioncomprises the relation${{{Dist}^{2}{()}} = {( \frac{s - s_{\min}}{s_{\max} - s_{\min}} )^{2} + {w \cdot ( \frac{S_{L}^{*} - S_{L,\min}^{*}}{S_{L,\max}^{*} - S_{L,\min}^{*}} )^{2}}}},$wherein s is the number of states in a candidate Markov model

with likelihood score S_(L)*, S_(min) is a minimum possible number ofstates in a Markov model achieving a maximum possible likelihood scoreS_(L,max)*, is a maximum possible number of states in a Markov modelachieving a minimum possible likelihood score S_(L,min)*, and w is ascaling factor.
 18. The computer program product of claim 15, whereinthe computer instructions are further configured to cause the one ormore computing devices to partition the event data into multipleclusters, each cluster yielding a preferred Markov Model, anditeratively adjust the clusters by reassigning each event sequence tothe cluster whose preferred Markov Model maximizes the objectivefunction for that sequence.
 19. The computer program product of claim 15wherein the computer instructions are further configured to cause theone or more computing devices to select at least some of the candidateMarkov models with reference to the fitness measure by selecting eachselected candidate Markov model according to one of (i) a likelihoodscore of the selected candidate Markov model, (ii) a minimal number ofnodes in the selected candidate Markov model, or (iii) an objectivefunction score on the selected candidate Markov model.
 20. The computerprogram product of claim 15 wherein the event data comprises one of (i)search queries submitted to a search engine (ii) purchases on an onlinecommerce site (iii) locations or visited or trajectories taken on a map(iv) pages visited on a website (v) user interactions with a softwaresystem or user interface.